Data is Wicked

 

Picture this: A data scientist spends hours configuring YAML files, debugging pipeline errors, and switching between five different tool UIs just to test a simple feature idea. Meanwhile, an analyst in Excel has already cleaned their data, built a model, and presented insights to stakeholders. Something’s wrong with this picture – and it’s not the analyst using Excel.

The Hidden Complexity in Data Science Work

Data science is inherently complex, requiring collaboration across diverse roles: data scientists producing insights, engineers making data available, and domain experts translating real-world problems into data. But our current tools often add unnecessary complexity to this already challenging work.

Consider these real-world scenarios:

The Modern Data Science Experience: Chloe’s Story

Chloe, a data scientist, is improving a book recommendation system. Her “modern” stack includes a feature store, model store, and evaluation store.

ML Infrastructure Stores Credit: Aparna Dinakar

Here’s her typical workflow:

  1. She starts by checking model metrics across two separate UIs
  2. To understand poor recommendations, she digs through the evaluation store
  3. When she spots a potential improvement, she prototypes in a notebook
  4. To productionize her feature, she must:
    • Rewrite notebook code into a specific format
    • Create YAML configurations
    • Debug deployment issues
    • Wait 20+ minutes for feature computation
  5. After all this infrastructure work, she discovers a bug in her feature
  6. She starts the entire process over again

The Analytics Experience: Larry’s Story

Larry, an analyst investigating customer churn, has a radically different experience with Excel:

  1. He loads customer order data directly into Excel
  2. When he spots data issues, he fixes them immediately using built-in functions
  3. He visualizes patterns by selecting columns and creating charts
  4. When he needs marketing data, he imports it and joins it right there
  5. He iterates on his analysis in real-time, cleaning and modeling as he goes
  6. He produces insights ready for stakeholder review

Both are solving data problems. Both need to clean data, analyze patterns, and build models. Yet their experiences couldn’t be more different. Chloe spends most of her time wrestling with infrastructure, while Larry spends his time solving the actual problem.

Why Data Science is a “Wicked Problem”

Data science challenges share key characteristics with what design theorists call “wicked problems”:

  1. The solution shapes the problem: How we model data influences what solutions we can discover
  2. Multiple stakeholder perspectives: Each role brings different priorities and constraints
  3. Evolving constraints: Requirements and resources change as understanding deepens
  4. Never truly finished: Solutions require continuous refinement and adaptation

This isn’t just academic theory – it explains why our current tooling often falls short. We’ve optimized for production stability while sacrificing the iterative nature of data science work.

Consider this example of modern ML infrastructure:

Kubeflow Example Is this really how data scientists want to work?

What Makes Excel Work (Despite Its Flaws)

Excel has serious limitations:

  • Limited scale
  • Poor version control
  • Lack of audit trails
  • Potential stability issues

Yet people keep using it. As Gavin Mendel Gleason aptly notes:

“People refuse to stop using Excel because it empowers them and they simply don’t want to be disempowered.”

Excel succeeds because it provides two critical capabilities:

  1. Iterability: Users can rapidly experiment, seeing results immediately
  2. Accessibility: Anyone can view and work with the data at their skill level

Excel succeeds because it prioritizes solving the essential problem (understanding data) over accidental complexity (infrastructure concerns).

Building Better Data Science Tools

To create better tools for data science, we need to learn from Excel’s success while addressing its limitations. Future tools should:

1. Embrace Accessibility Through Layered APIs

  • Provide simple interfaces for basic operations
  • Enable progressive complexity for advanced users
  • Support different user roles and skill levels
  • Example: FastAI’s layered API approach, allowing both high-level and low-level control

2. Enable Rapid Iteration

  • Minimize context switching between tools
  • Provide immediate feedback on changes
  • Allow experimentation without heavy reconfiguration
  • Example: Databricks’ notebook-centric workflow that scales to production

3. Meet Users Where They Are

  • Build SDK-first instead of UI-first
  • Integrate with existing workflows (notebooks, CLI, etc.)
  • Support multiple interfaces for different use cases
  • Example: Netflix’s Papermill for productionizing notebooks

4. Focus on Essential Problems

  • Solve data understanding challenges first
  • Handle infrastructure concerns behind the scenes
  • Provide sensible defaults with room for customization
  • Example: Modern data warehouses abstracting away distribution complexity

The Path Forward

The future of data science tooling isn’t about adding more infrastructure – it’s about making complex infrastructure invisible while empowering users to solve real problems. We need tools that:

  1. Scale with complexity: Start simple, but support sophisticated use cases
  2. Enable collaboration: Make it easy for different roles to contribute
  3. Promote iteration: Support rapid experimentation and refinement
  4. Abstract wisely: Hide unnecessary complexity while maintaining power

Key Takeaways for Tool Builders and Users

For tool builders:

  • Prioritize user workflow over infrastructure
  • Build layered interfaces that grow with users
  • Focus on reducing cognitive overhead

For practitioners:

  • Evaluate tools based on iteration speed
  • Look for solutions that scale with your needs
  • Don’t accept unnecessary complexity

Data science will always involve complex problems. But our tools should help us tackle that essential complexity rather than adding accidental complexity. The next generation of data science tools must learn from Excel’s strengths while overcoming its limitations.

Remember: The best tool isn’t always the most sophisticated – it’s the one that helps you solve real problems most effectively.


Is Your Data Wicked?

If this post resonated with you, I would love to chat with you. Schedule a free consult to talk about how to tame your wicked data.