13 Upcoming Strata Sessions that I’m Excited About

For data-obsessed technologists and business professionals, Strata is the conference to attend. AI and machine learning have taken quantum leaps in the last few years. It is no surprise that this year’s Strata sessions have a specific focus on these.

More organizations are collecting data about everything today than they have ever done up until the last decade. However, collecting data and making sense of it to make business decisions are two entirely different agendas.

Hence, this year’s Strata aims to focus on sessions that aim to solve specific data challenges. These range from architecting an enterprise big data analytics platform to more hands-on sessions such as deploying machine learning models for production use.

With an exhaustive agenda that spans across three days, here are 13 sessions that I’m looking forward to.

1. Architecting a data platform for enterprise use

Hosted by two industry veterans, this session looks at the design principles for creating a multi-use data infrastructure. More importantly, it focusses on data infrastructure that is forward-looking and is not held back by past constraints. Most organizations acquire technology piecewise and mostly ignore the broader IT landscape within which this technology resides.

Meant for architects, system designers, and managers, this session will walk attendees through a reference multi-use data and analytics architecture.
Why this session interests me: For a real competitive advantage, enterprises need to adopt a robust big data analytics system.  As a consultant, I advise my clients on how to setup up their infrastructure for the production use of analytics. Hence, this session would help me learn the fundamental design principles to do that. The reference architecture walkthrough will be particularly useful.

2. Running multidisciplinary big data workloads in the cloud

Split across two parts, the first part of this strata session aims to educate attendees on successfully running a data analytics pipeline in the cloud and its considerations. In the second part, the presenters will coach a live session. Here attendees will get hands-on and successfully set up a data analytics pipeline in the cloud that integrates with data engineering and warehousing workflows.  

Why this session interests me: Cloud infrastructure, especially in the serverless paradigm, allows for efficient use of resources to achieve analytics and other business goals. However, with the advent of large data volumes, the movement of data from one location to another is a significant challenge. It is not always possible to take computing to where data exists. As a result, I am hoping this session will impart expert opinion on some of the solutions that help solve these challenges.

3. Automating DevOps for Machine Learning

Data scientists understand all too well the hassles of deploying their ML models into production. These challenges include incompatible languages and environments, erroneous processes, and poor communication.

Targeted at data scientists, DevOps professionals, and machine learning engineers, this session aims to help understand common deployment architectures for machine learning models in the real world.

Why this session interests me: I need to advise clients on how to set up production infrastructure for the models they have built. We have multiple engagements where we are helping our clients move from an in-house developed model to a production pipeline. This deployment is not a straightforward exercise as there are several complications. I’m hoping this session will help me glean insights on how to accelerate deployments of these machine learning models.

4. Yay, we are going to deploy an AI/ML-powered app. But wait! Where do I deploy?

AI and ML-powered applications are growing in increasingly large numbers. With this growth, organizations often wonder about the deployment options for these applications. Large enterprises lean towards multi-cloud hybrid platforms. However, smaller companies have to decide whether to go with a full or hybrid cloud environment.

This session talks about some of these challenges and its possible solutions.

Why this session interests me: Continuing on the topic of deploying AI/ML-based applications, not all deployment solutions are the same, and it is dangerous to lock-in to a cloud vendor that lacks the right amount of required features. Hence, this session would help me get a proper perspective on the nuances involved in this decision making from an experienced presenter.

5. Real-Time Analytics at Uber: bring SQL into everything

Big data is everywhere in the world’s largest ride-hailing company. However, moving this data between sources is expensive and time-consuming. Hence, analysts and engineers at Uber have to run SQL queries in real-time on all data sources.

This Strata session elaborates Uber’s engineering efforts in running live SQL queries in real-time on all data sources without any data copy using Hadoop, Spark, and Presto.
Why this session interests me: Machine learning has evolved beyond the basic MapReduce paradigm to a SQL paradigm. However, Not all SQL adapters for big data sources are the same, and there is no clear winner, for now. However, these are good learnings.

6. Apache Superset: An open-source data visualization platform

Originally developed at Airbnb, Apache Superset is an enterprise-ready business intelligence web application. It can be used to craft beautiful dashboards with the ability to slice and dice data.

In this session, the presenter offers an overview of Superset. He is scheduled to discuss its underlying technology, open source development dynamics, and the essential items on its roadmap.

Why this session interests me: The end goal of any analytics initiative, is to visualize data. We now have a solution specifically designed for this. Some of our projects are only to build such visualizations. I’d keep an eye on Superset and its development.

7. Talking to the machines: Monitoring production machine learning systems

Production machine learning models require constant monitoring to ensure that its inferences are correct based on feedback and labels. However, labels are not available all the time or are too expensive and resource-intensive to obtain.

This session discusses the design and implementation of a real-time automated system to monitor production machine, learning models. The idea is to detect system anomalies such as spurious false positives, gradual drifts, and the inability of the model to grasp the target concept.

Why this session interests me: Loss of quality of machine learning recommendations over time and bias creep, and model poisoning are significant challenges to the production ML models. This real-time automated monitoring system would ensure that the model can be fine-tuned more rapidly and have the ability to stay on track with its inferences.

8. The future of machine learning is decentralized

Connected devices today have computational and storage capabilities. Using federated learning, these connected devices can now securely learn ML models while retaining all data locally. Federated learning, puts the user back in control of the data while letting developers build intelligent applications using that data. It also provides a path for personalization at scale while reducing costs and risks associated with sensitive data handling.

This Strata session aims to introduce federated learning, how it differs from the traditional centralized approach and the use cases it fits best. The session will also see the presenter demonstrate a federated learning model in TensorFlow.

Why this session interests me: Federated machine learning takes machine learning to the data source while traditional machine learning algorithms work only with aggregated data. This is a new learning and hence useful.

9. Flink SQL in Action

In the past, stream processing frameworks mostly provided Java or Scala based APIs. Today however, stream processing is rapidly gaining enterprise adoption, and now stream processing can be done using SQL. This SQL-based processing makes stream processing accessible to non-programmers enabling them to carry out everyday tasks.

Why this session interests me: Similar to the session on bringing SQL into everything, machine learning has evolved beyond the basic MapReduce paradigm to a SQL paradigm. Not all SQL adapters into big data sources are the same, and there is no clear winner for now. Apache Flink has been around for a while, and it is evolving quickly. This is intriguing at the very least.

10. Loosely coupled data with Apache Arrow Flight

The number of data tools has risen exponentially in recent years. While powerful, they have not been able to interconnect, especially, when they have legacy interfaces such as ODBC, JDBC and REST. Apache Arrow attempts to solve this problem by enabling these systems to interchange common representations of data through in- and near-process communications.

However, for more complex and distributed topologies, Arrow Flight allows communication of data streams in large parallel streams. Using a high-performance protocol and a set of libraries, Arrow Flight lets engineering professionals quickly build data services that can move data between systems very quickly.

This Strata session gives a detailed walkthrough of Arrow Flight including its components, performance benchmarks, and live use cases. It also aims to showcase its opportunities for growth and its role in the concept of data micro-services.

Why this session interests me: Not all SQL adapters into big data sources are the same, and there is no clear winner for now. Apache Flink has been around for a while, and it is evolving quickly. That is intriguing at the very least.

11. Bullet: Querying Streaming Data in Transit with Sketches

Bullet is an open source query system that can query any streaming data set without storing this data. It is forward-looking and can run queries on an unbounded data set. In this Strata session, the presenters will talk about the innovative architecture of Bullet. It also includes a live demo on a real, high-volume data set. The Strata session also aims to solve common implementation challenges and how Sketches fit into it.

Why this session interests me: This is new learning. Querying data streams in real time has always been a source of grief for developers – until now. This one aims to filter, project, and aggregate data in transit which is very interesting.

12. Database migrations don’t have to be painful, but the road will be bumpy

Data migrations are hardly ever painless and smooth. Especially when downtime is not an option, and the database is a 200-node cluster with hundreds of terabytes of data. In this session, the presenters discuss an active/passive solution which makes this migration possible using an extensible database client.

Aimed at DevOps and engineers, this session presents learnings from such a migration.

Why this session interests me: Ever since companies have started collecting data from all possible sources, data storage requirements and solutions have kept evolving. Cost and performance remain a significant factor in such migrations. It will be great to understand the learnings and any novel solutions for a typical data migration.

13. Model Governance in the Enterprise

The number of ML models that make business decisions is proliferating. However, it is hard to tell why a model has made a particular decision, if there is a bias in it, or if the dataset is being slowly poisoned.

This Strata session talks about these challenges and more, in scaling machine learning models.

Why this session interests me: Machine learning models are being increasingly used to influence and in some cases make decisions that impact human lives, apart from profit. In this regard, it is necessary to maintain a strict practice of monitoring and tuning machine learning models that are running in a production environment. This talk promises to be exciting, as experienced people give their opinions on the subject. If you found this post interesting or useful, here is our most recent content on similar topics:

Leave a Comment

Your email address will not be published. Required fields are marked *

About the author

Mukund Rajamannar

Mukund is the Director of Engineering at Synersip. Building products is his passion. He has designed and architected product roadmaps with teams to build horizontal scalable and performant product solutions on service-oriented architecture principles. Additionally, Mukund has built development teams focused on continuous engagement and delivery using open source and cloud services. He enjoy helping customers realize their Data Science, Big Data, and AI product dreams in a cost-efficient manner.

Talk to us

Scale your engineering team, decrease time to market and save at least 50 percent with our optimized Agile development teams.

Contact us