Jupyter has revolutionized data science, and it started with a chance meeting between two students

Jupyter has revolutionized data science, and it started with a chance meeting between two students

Commentary: Jupyter makes it easy for data scientists to collaborate, and the open source project’s history reflects this kind of communal effort.

data-scientist.jpg

Image: iStockphoto/shironosov

If you want to do data science, you’re going to have to become familiar with Jupyter. It’s a hugely popular open source project that is best known for Jupyter Notebooks, a web application that allows data scientists to create and share documents that contain live code, equations, visualizations and narrative text. This proves to be a great way to extract data with code and collaborate with other data scientists, and has seen Jupyter boom from roughly 200,000 Notebooks in use in 2015 to millions today. 

More about Big Data

Jupyter is a big deal, heavily used at companies as varied as Google and Bloomberg, but it didn’t start that way. It started with a friendship. Fernando Pérez and Brian Granger met the first day they started graduate school at University of Colorado Boulder. Years later in 2004, they discussed the idea of creating a web-based notebook interface for IPython, which Pérez had started in 2001. This became Jupyter, but even then, they had no idea how much of an impact it would have within academia and beyond. All they cared about was “putting it to immediate use with our students in doing computational physics,” as Granger noted.

These things take time

Today Pérez is a professor at University of California, Berkeley, and Granger is a principal at AWS, but in 2004 Pérez was a postdoctoral student in Applied Math at UC Boulder, and Granger was a new professor in the Physics Department at Santa Clara University. As mentioned, they first met as students in 1996, and both had been busy in the interim. Perhaps most pertinently to the rise of Jupyter, in 2001 Pérez started dabbling in Python and, in what he calls a “thesis procrastination project,” he wrote the first IPython over a six-week stretch: a 259-line script now available on GitHub (“Interactive execution with automatic history, tries to mimic Mathematica’s prompt system”). 

SEE: Top 5 programming languages for data scientist to learn (free PDF) (TechRepublic)

It would be tempting to assume this led to Pérez starting Jupyter–it would also be incorrect. The same counterfactual leap could occur if we remember that Granger wrote the code for the actual IPython Notebook server and user interface in 2011. This was important, too, but Jupyter wasn’t a brilliant act by any one person. It was a collaborative, truly open source effort that perhaps centered on Pérez and Granger, but also people like Min Ragan-Kelley, one of Granger’s undergraduate students in 2005, who went on to lead development of IPython Parallel, which was deeply influential in the IPython kernel architecture used to create the IPython Notebook. 

However we organize the varied people who contributed to the origin of Jupyter, it’s hard to get away from “that one conversation.”

In 2004 Pérez visited Granger in the San Francisco Bay Area. The old friends stayed up late discussing open source and interactive computing, and the idea to build a web-based notebook came into focus as an extension of some parallel computing work Granger had been doing in Python, as well as Pérez’s work on IPython. According to Granger, they half-jokingly talked about these ideas having the potential to “take over the world,” but at that point their idea of “the world” was somewhat narrowly defined as scientific computing within a mostly academic context. 

Years (and a great deal of activity) later, in 2009, Pérez was back in California, this time visiting Granger and his family at their home in San Luis Obispo, where Granger was now a professor. It was spring break, and the two spent March 21-24 collaborating in person to complete the first prototype IPython kernel with tab completion, asynchronous output and support for multiple clients.

By 2014, after a great deal of collaboration between the two and many others, Pérez, Granger and the other IPython developers co-founded Project Jupyter and rebranded the IPython Notebook as the Jupyter Notebook to better reflect the project’s expansion outwards from Python to a range of other languages including R and Julia. Pérez and Granger continue to co-direct Jupyter today.

Theory of scientific revolutions

“What we really couldn’t have foreseen is that the rest of the world would wake up to the value of data science and machine learning,” Granger stressed. It wasn’t until 2014 or so, he went on, that they “woke up” and found themselves in the “middle of this new explosion of data science and machine learning.” They just wanted something they could use with their students. They got that, but in the process they also helped to foster a revolution in data science. 

How? Or, rather, why was it that Jupyter has helped to unleash so much progress in data science? Rick Lamers explained:

Jupyter Notebooks are great for hiding complexity by allowing you to interactively run high level code in a contextual environment, centered around the specific task you are trying to solve in the notebook. By ever increasing levels of abstraction data scientists become more productive, being able to do more in less time. When the cost of trying something is reduced to almost zero, you automatically become more experimental, leading to better results that are difficult to achieve otherwise.

Data science is…science; therefore, anything that helps data scientists to iterate and explore more, be it elastic infrastructure or Jupyter Notebooks, can foster progress. Through Jupyter, that progress is happening across the industry in areas like data cleaning and transformation, numerical simulation, exploratory data analysis, data visualization, statistical modeling, machine learning and deep learning. It’s amazing how much has come from a chance encounter in a doctoral program back in 1996.

Disclosure: I work for AWS, but the views expressed herein are mine.

Also see

Source of Article