Tathagata Das is an Apache Spark committer and a member of the Project Management Committee. He’s the lead developer behind Spark Streaming and currently develops Structured Streaming at Databricks. Previously,he was a grad student in the UC Berkeley at AMPLab,where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica
主题摘要:Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
In this session, Tathagata Das will walk through the basic concepts of Structured Streaming and walk-through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We will also take a quick look at event-time aggregations, sessionization operations, and other advanced operations.